A genome assembly of the American black bear, Ursus americanus, from California

Abstract The American black bear, Ursus americanus, is a widespread and ecologically important species in North America. In California, the black bear plays an important role in a variety of ecosystems and serves as an important species for recreational hunting. While research suggests that the populations in California are currently healthy, continued monitoring is critical, with genomic analyses providing an important surveillance tool. Here we report a high-quality, near chromosome-level genome assembly from a U. americanus sample from California. The primary assembly has a total length of 2.5 Gb contained in 316 scaffolds, a contig N50 of 58.9 Mb, a scaffold N50 of 67.6 Mb, and a BUSCO completeness score of 96%. This U. americanus genome assembly will provide an important resource for the targeted management of black bear populations in California, with the goal of achieving an appropriate balance between the recreational value of black bears and the maintenance of viable populations. The high quality of this genome assembly will also make it a valuable resource for comparative genomic analyses among black bear populations and among bear species.


Introduction
The American black bear, Ursus americanus, is an ecologically and economically important species in North America (Fig. 1).Historically, black bears were widely distributed, but loss of habitat has restricted that range, particularly in the United States (Pelton et al. 1999).In California, the species plays a critical role in many ecosystems, while also serving as an important species for recreational hunting (California Department of Fish and Game 1998).While research suggests that the California populations are currently healthy, continued monitoring is critical to developing targeted management plans in order to achieve an appropriate balance between the recreational value of black bears and the maintenance of viable populations across the state (California Department of Fish and Game 1998).
Genomic resources, including high-quality genome assemblies, will provide valuable tools for the assessment of black bear populations.Genomic analyses will enable the development of population-specific management strategies by assessing population connectivity, inbreeding depression, and local adaptation.The results of these analyses will aid managers in maintaining healthy black bear populations across their range.This genome, generated from a sample from California, will be instrumental in understanding genetic variation unique to populations in the western United States and can also be used in pangenomic analyses with existing assemblies to better represent the diversity of black bears throughout their native range.There are publicly available genome assemblies from two samples, both from the eastern United States.One is contig-level (NCBI accession GCF_020975775.1); the other is scaffold-level (GCA_003344425.1,Srivastava et al. 2019).Multiple reference genomes from divergent lineages enable the identification of structural variants, which may play a critical role in local adaptation and population health.
High-quality genome assemblies will also enable comparative genomics analyses across bear species.Recent advances in multiple reference genome alignment have enabled the discovery of genetic characteristics important to species conservation (Wilder et al. 2023), as well as the evolutionary innovations unique to various lineages (Christmas et al. 2023).Ongoing efforts to generate high-quality genome assemblies for all extant bear lineages will enable the identification of deleterious and adaptive genetic variation, both within the lineage and at broader taxonomic levels.
Here we report a high-quality, near chromosome-level genome assembly generated from a California black bear as part of the California Conservation Genomics Project (CCGP; Shaffer et al. 2022).This genome assembly will be a valuable resource for management of the black bears across California and the rest of North America.

Biological materials
We captured and sedated an adult male black bear (L20-20) for relocation in September 2020 at Kings Beach, Placer County, California (39.2377°N, 120.0266°W).California Department of Fish and Wildlife (CDFW) staff captured the bear under the department's jurisdiction as the trustee for wildlife management in the state of California, CA Fish & Game Code § 1802Code § (2015)).While the bear was sedated, CDFW staff collected a whole-blood sample into a tube containing EDTA.

DNA extraction
We isolated high molecular weight (HMW) genomic DNA (gDNA) from the whole blood sample.We added 3 ml of RBC lysis solution (Qiagen, Germany; Cat # 158445) to 1 ml of whole blood and incubated the reaction at room temperature for 5 min.We centrifuged the sample at 2,000 x g for 5 min to pellet white blood cells.We discarded the supernatant and added 2 ml of lysis buffer containing 100 mM NaCl, 10 mM Tris-HCl pH 8.0, 25 mM EDTA, 0.5% (w/v) SDS, and 100 µg/ml Proteinase K to the cell pellet.We incubated the reaction at room temperature for a few hours until the solution was homogenous.We then treated the lysate with 20 µg/ml RNase A at 37 °C for 30 min and cleaned with equal volumes of phenol/chloroform using phase lock gels (Quantabio, Beverly, MA; Cat # 2302830).We precipitated the DNA by adding 0.4× volume of 5M ammonium acetate and 3× volume of ice-cold ethanol.We washed the DNA pellet twice with 70% ethanol and resuspended it in an elution buffer (10 mM Tris, pH 8.0).We assessed the purity of gDNA using a NanoDrop ND-1000 spectrophotometer and observed a 260/280 ratio of 1.85 and a 260/230 ratio of 2.13.We quantified the DNA yield at 15 µg with a Qbit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA).We verified the integrity of the HMW gDNA on a Femto pulse system (Agilent Technologies, Santa Clara, CA), confirming that 70% of the DNA was in fragments over 100 kb.

PacBio HiFi library preparation and sequencing
We constructed a HiFi SMRTbell library using the SMRTbell Express Template Prep Kit v2.0 (Pacific Biosciences [PacBio], Menlo Park, CA; Cat.#100-938-900) according to the manufacturer's instructions.We sheared HMW gDNA to a target DNA size distribution of 15 to 20 kb using Diagenode's Megaruptor 3 system (Diagenode, Belgium; Cat.B06010003) and then concentrated the sheared DNA using 0.45× of AMPure PB beads (PacBio; Cat.#100-265-900).We then processed the DNA through a series of enzymatic reactions: removal of single-strand overhangs at 37 °C for 15 min, DNA damage repair at 37 °C for 30 min, end repair and A-tailing at 20 °C for 10 min and 65 °C for 30 min, ligation of overhang adapters v3 at 20 °C for 60 min followed by 65 °C for 10 min to inactivate the ligase, and nuclease treatment at 37 °C for 1 h.We then purified and concentrated the SMRTbell library with 0.45× Ampure PB beads for size selection using the BluePippin/PippinHT system (Sage Science, Beverly, MA; Cat #BLF7510/HPE7510) to collect fragments greater than 9 kb, with a resulting average fragment size of 16 kb.We sequenced the HiFi SMRTbell library at UC Davis DNA Technologies Core (Davis, CA) using three SMRT Cell 8M trays (PacBio, Cat #101-389-001), Sequel II sequencing chemistry 2.0, and 30-h movies for each cell on a PacBio Sequel II sequencer.

Omni-C library preparation and sequencing
We prepared an Omni-C library from whole blood (ID:AMBB2020-038-001) using a Dovetail Omni-C Kit (Dovetail Genomics, Scotts Valley, CA) according to the manufacturer's protocol with slight modifications.We crosslinked the chromatin in the nucleus, digested the chromatin with DNase I, repaired chromatin ends and ligated a biotinylated bridge adapter to the repaired ends, reversed the crosslinks, and purified the DNA.We treated purified DNA to remove biotin that was not internal to ligated fragments.We generated a short-read sequencing library using an NEB Ultra II DNA Library Prep kit (New England Biolabs, Ipswich, MA) with an Illumina-compatible y-adaptor.We captured biotin-containing fragments using streptavidin beads.We split the post-capture product into two replicates prior to polymerase chain reaction (PCR) enrichment to preserve library complexity, with each replicate receiving a unique dual index.We sequenced the library at the Vincent J. Coates Genomics Sequencing Laboratory (Berkeley, CA) on the Illumina NovaSeq 6000 platform with an S4 flow cell (Illumina, San Diego, CA).

Nuclear genome assembly
We assembled the genome of U. americanus following the CCGP assembly pipeline version 4.0 (https://github.com/ccgproject/ccgp_assembly). Table 1 lists the software and non-default parameters used in the assembly.First, we removed the remnant adapter sequences from the PacBio HiFi dataset using HiFiAdapterFilt (Sim et al. 2022).We then generated the initial, partially phased, diploid assembly using HiFiasm (Cheng et al. 2022) in Hi-C mode using the adaptertrimmed PacBio HiFi reads and the Omni-C data.Next, we aligned the Omni-C data to the primary assembly following the Arima Genomics Mapping Pipeline (https://github.com/ArimaGenomics/mapping_pipeline) and then scaffolded it with SALSA (Ghurye et al. 2017(Ghurye et al. , 2019)).
We identified the X chromosome in our assembly using synteny with the domestic dog genome.We used Nucmer (Kurtz et al. 2004) to align a domestic dog X chromosome (NCBI RefSeq GCF_011100685.1, scaffold NC_049260.1)to our assembly and examined hits longer than 10 kb with greater than 80% identity.We also attempted to identify the Y chromosome in our assembly using the same process with a domestic dog Y chromosome (NCBI GenBank KP081776.1).

Genome assembly assessment
We generated k-mer counts from the adapter-trimmed PacBio HiFi reads using Meryl (https://github.com/marbl/meryl).We used these k-mer counts in GenomeScope2.0(Ranallo-Benavidez et al. 2020) to estimate genome features, including genome size, heterozygosity, and repeat content.To obtain general contiguity metrics, we ran QUAST (Gurevich et al. 2013).To evaluate genome quality and functional completeness we used BUSCO (Manni et al. 2021) with the Mammalia ortholog database (mammalia_odb10), which contains 9,226 genes.We assessed base-level accuracy and k-mer completeness using Merqury (Rhie et al. 2020) with the previously generated Meryl database.We further estimated genome assembly accuracy using a frameshift analysis of the BUSCO gene set, as described in Korlach et al. (2017).We determined the size of the phased blocks based on the size of the contigs generated by HiFiasm in HiC mode.Following the quality metric nomenclature established by Rhie et al. (2021), we calculated the genome quality code x.y.P.Q.C, where, x = log10[contig NG50]; y = log10[scaffold NG50]; P = log10 [phased block NG50]; Q = Phred base accuracy QV (quality value); C = % genome represented by the first "n" scaffolds, following a karyotype of 2n = 74 (Nash and O'Brien 1987).We calculated these quality code metrics for the primary assembly.
We compared genome statistics for our assembly (mUrsAme1) to two other black bear genome assemblies available: ASM334442v1 (NCBI Genome: GCA_003344425.1) and gsx_jax_bbear_1 (NCBI RefSeq GCF_020975775.1).We generated the contiguity metrics using QUAST and the functional completeness metrics using BUSCO with the Mammalian ortholog database.

Mitochondrial genome assembly
We assembled the mitochondrial genome of U. americanus from the PacBio HiFi reads using the reference-guided pipeline MitoHiFi (Uliano-Silva et al. 2023).We used an existing mitochondrial sequence of U. americanus (NCBI:AF303109.1;Delisle and Strobeck 2002) as the starting reference sequence.We searched for matches of the resulting mitochondrial assembly sequence in the nuclear genome assembly using BLAST+ (Camacho et al. 2009) and filtered out contigs and scaffolds from the nuclear genome assembly with a sequence identity >99% and size smaller than the mitochondrial assembly sequence.We annotated the mitochondrial genome using MitoFinder (Allio et al. 2020).

Results
The Omni-C library generated 130.9 million read pairs and the PacBio HiFi library generated 6.1 million reads.The PacBio HiFi sequences yielded ~38× genome coverage and had an N50 read length of 15,293 bp; a minimum read length of 43 bp; a mean read length of 14,915 bp; and a maximum read length of 52,231 bp (see Supplementary Fig. S1 for read length distribution).Based on the PacBio HiFi data, Genomescope 2.0 estimated a genome size of 2.37 Gb, a 0.17% sequencing error rate, and 0.358% heterozygosity.The k-mer spectrum shows a bimodal distribution with a major coverage peak at ~37× coverage and a minor coverage peak at ~18× coverage (Fig. 2A).
The final genome assembly (mUrsAme1) consists of two partially phased haplotypes.Both assemblies are similar in size, but not equal to the estimated genome size from GenomeScope2.0, as has been observed in other taxa (see Pflug et al. 2020, for example).The primary assembly (mUrsAme1.0.p) consists of 316 scaffolds spanning 2.52 Gb with a contig N50 of 58.85 Mb, a scaffold N50 of 67.55 Mb, the largest contig size of 107.13 Mb, and the largest scaffold size of 122.37 Mb.Given the level of fragmentation of the alternate assembly, we kept it as a contig-level assembly.The alternate assembly (mUrsAme1.0.a) consists of 77,310 contigs spanning 2.88 Gb with a contig N50 of 60.74 kb and the largest contig size of 831.37 kb.The fragmentation of the alternate assembly is likely due to the low heterozygosity of the sampled individual because the alternate assembly represents heterozygous regions of the genome.The primary assembly has a BUSCO completeness score for the Mammalia gene set of 96.30%, a base pair QV of 63.01, k-mer completeness of 98.18%, and a frameshift indel QV of 43.13.The alternate assembly has a BUSCO completeness score for the Mammalia gene set of 62.6%, a base pair QV of 56.97, a k-mer completeness of 75.54%, and a frameshift indel QV of 42.81.
During manual curation, we made 11 joins and 1 break on the primary assembly based on the initial Omni-C contact map.We filtered out 4 contigs corresponding to mitochondrial  contamination, one from the primary assembly and 3 from the alternate assembly.No further contigs were removed.The Omni-C contact map for the final primary assembly shows a highly contiguous assembly (Fig. 2B).Assembly statistics are reported in Table 2 and represented graphically in Fig. 2C.We have deposited the genome assembly on NCBI GenBank (see Table 2 and Data Availability for details).
Our assembly shows improved contiguity compared to other available assemblies for black bears (Table 3).Our primary assembly is represented in fewer contigs and scaffolds and has higher N50 statistics.The BUSCO scores for both our primary assembly and a previously assembled genome (GCF_020975775.1)are >95%, indicating that both assemblies are complete and single copies in these conserved regions of the genome.
We examined chromosome assignments and determined that our assembly is near-chromosome level.We identified scaffold JANIGQ010000001.1 in our assembly as the X chromosome based on synteny with the domestic dog genome.No scaffolds in our assembly had alignments to the domestic dog Y chromosome that matched our criteria of longer than 10 kb with greater than 80% identity.A handful of scaffolds had shorter alignments, indicating that the Y chromosome in our assembly is fragmented.We did not attempt to assign scaffolds in our assembly to autosomes based on the black bear karyotype (Nash and O'Brien 1987).However, we note that with a karyotype indicating 2n = 74 chromosomes, 92% of our assembly is contained in the 37 largest scaffolds (with the largest scaffold identified as the X chromosome), suggesting our assembly is near-chromosome level.
The final mitochondrial sequence has a size of 16,789 bp, with the base composition of A = 31.21%,C = 25.09%,G = 15.44%,T = 28.26%, and consisting of 2 rRNAs, 23 unique transfer RNAs, and 13 protein-coding genes.We evaluate the mitochondrial genome to be partial because while it is close to the expected size, the expected circularity is not supported.Additionally, while we annotated the expected number of rRNAs and protein-coding genes, the number of transfer RNAs differs from expected.The mitochondrial genome is scaffolded JANIGQ010000317.1 in our assembly.

Discussion
We generated a high-quality, near chromosome-level genome assembly for an American black bear from California.This new genome will serve as the foundation for landscape and population genomic analyses that will aid conservation decision-makers.Large mammals can serve as umbrella species, whose conservation can extend protections to other species in the same habitat, and healthy bear populations are often an indication of ecosystem health (Pelton et al. 1999).The genome assembly is a foundational component of studies on the effects of habitat loss and fragmentation on wildlife populations, particularly the impacts of local adaptation and inbreeding depression.
This new black bear genome assembly expands opportunities for pangenomic analyses within the species.Both previously assembled genomes are from the eastern United States, whereas our new genome is from the western United States, enabling comparisons to identify potentially adaptive genomic differences to different habitats and anthropogenic pressures.For example, it is known that hibernation length and coat color vary across the range of black bears (Gámez-Brunswick and Rojas-Soto 2020; Puckett et al. 2023).
This new black bear genome assembly also expands opportunities for comparative genomic analyses among bear species.There are 8 extant species of bears, and all of them have high-quality reference genomes available or in progress (Willey and Korstanje 2022;Beth Shapiro, personal communication).These bear species live in diverse habitats from the Arctic to the Tropics and survive on a variety of diets, including both generalist and specialist diets (Pelton et al. 1999).The availability of genome assemblies for species with divergent ecological pressures and phenotypes enables the identification of both coding and regulatory variation that may underlie ecologically important variation.The inclusion of additional individuals and/or species into taxonomically broad multi-species alignments, such as the Zoonomia alignment of placental mammals (Christmas et al. 2023), may be useful in identifying adaptations unique to bears, in addition to functional variation that may be important for black bear conservation.

Fig. 1 .
Fig. 1.The American black bear, Ursus americanus, is a widespread species that can be found in a variety of habitats, from dense forests to open grasslands.Photos from California Department of Fish and Wildlife (left, CC BY 2.0), Florida Fish and Wildlife Conservation Commission (top right, public domain), and David Wasserman (bottom right, CC BY-SA 4.0).

Fig. 2 .
Fig. 2. Visualization of assembly metrics.(A) K-mer frequencies from the adapter-trimmed PacBio HiFi data used to estimate genome size, sequencing error rate, and heterozygosity.The main peak at ~37× coverage corresponds to homozygous regions of the genome, while the slight peak at ~18× corresponds to heterozygous regions of the genome.The peak around zero corresponds to probable sequencing errors.(B) The omni-C contact map for the primary assembly after manual curation shows the 3D organization of the genome, with darker areas indicating closer proximity.(C) Snail plot displaying assembly metrics for the primary assembly.

Table 1 .
Assembly pipeline and software used.

Table 3 .
Comparison of genome assembly statistics.